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I. Divergences and divergence statistics 

Many of the divergence measures used in statistics are of 
the /-divergence type introduced independently by I. Csiszar 
Q], T. Morimoto J3, and Ali and Silvey yj. Such divergence 
measures have been studied in great detail in (4). Often 
one is interested inequalities for one /-divergence in terms 
of another /-divergence. Such inequalities are for instance 
needed in order to calculate the relative efficiency of two /- 
divergences when used for testing goodness of fit but there 
are many other applications. In this paper we shall study the 
more general problem of determining the joint range of any 
pair of /-divergences. The results are useful in determining 
general conditions under which information divergence is a 
more efficient statistic for testing goodness of fit than another 
/-divergence, but will not be discussed in this short paper. 

Let / : (0, oo ) — > R denote a convex function satisfying 
/ (1) = 0. We define / (0) as the limit lim t ^ / (*)• We define 
/* (t) = tf (t- 1 ) . Then /* is a convex function and /* (0) 
is defined as \im t ^otf (i -1 ) = lim^oo Assume that P 
and Q are absolutely continuous with respect to a measure 
fi, and that p = ^ and q = For arbitrary distributions 
P and Q the /-divergence Df(P,Q) > is defined by the 
formula 



Df(P,Q) 



/(-) dQ + r(Q)P(q = 0) (1) 

{ 9 >0} \Qy 



(for details about the definition ([T]) and properties of the /- 
divergences, see 0, H or 0). With this definition 

D f (P,Q) = D f . (Q,P). 



Example 1: The function f(t) 
distance 



It — 11 defines the L 1 - 



|P-Q|| = $> 



\Pi-9j\ ( cf - O) (2) 



which plays an important role in information theory and 
mathematical statistics (cf. Q or ||8l). 

In ([T]i is often taken the convex function / which is one of 
the power functions <\> a of order a 6 R given in the domain 
t > by the formula 



«(*) 



t a - a(t - 1) - 1 
a(a — 1) 



when a(a - 1) ^ (3) 




Fig. I . The joint range of total variation V and information D as determined 
in |8|. It was also proved that any point in the range 



and by the corresponding limits 

4> (t) = - Int + t - 1 and 
The ^-divergences 

def 



h{t) =tlnt-t + l. (4) 



(5) 



D a (P,Q) = J D^ a {P,Q), ael 

based on (0 and (0| are usually referred to as power diver- 
gences of orders a. For details about the properties of power 
divergences, see Q or (6). Next we mention the best known 
members of the family of statistics (|5j, with a reference to 
the skew symmetry D a (P,Q) = D\- a (Q, P) of the power 
divergences (0. 

Example 2: The x 2 -divergence or quadratic divergence 



D 2 (P,Q) = D_ 1 (Q,P) 



1 k 



3 = 1 



(6) 



1 bit 




Fig. 2. Joint range of total variation and Jensen-Shannon divergence. The 
2-point achievable pairs have dark shading and the 3-point achievable pairs 
have light shading. 



leads to the well known Pearson and Neyman statistics. The 
information divergence 



D 1 {P,Q) = D {Q,P) = Y,Pj^- 

3 = 1 Qj 



(7) 



leads to the log-likelihood ratio and reversed log-likelihood 
ratio statistics. The symmetric Hellinger divergence 

D 1/2 (P,Q) = D 1/2 (Q,P) = H(P,Q) 

leads to the Freeman-Tukey statistic. 

Example 3: The Hellinger divergence and the total variation 
are symmetric in the arguments P and Q. Non-symmetric 
divergences may be symmetrized. For instance the LeCam 
divergence is nothing but the symmetrized Pearson divergence 
given by 



D LeCam (P,Q) = \d 2 



P. 



P + Q 



Q 



Another symmetrized divergence is the Jensen Shannon diver- 
gence defined by 



JD 1 (P,Q) = ±D[ /> 



P + Q 



P + Q 



The joint range of total variation with Jensen Shannon diver- 
gence was studied by Briet and Harremoes [9] and is illustrated 
on Figure [2] 

In this paper we shall prove that the joint range of any 
pair of /-divergences is essentially determined by the range 
of distributions on a two-element set. In special cases the 
significance of determining the range over two-element set 
has been pointed out explicitly in iflOl . Here we shall prove 
that a reduction to two-elemnt sets can always be made. 

II. Joint range of /-divergences 

In this section we are interested in the range of the map 
(P, Q) -> (D f (P, Q) , D g (P, Q)) where P and Q are proba- 
bility distributions on the same set. 



Definition 4: A point (x, y) £ R 2 is a (/, g)-divergence 
pair if there exist a Borel space (X,P) with probability 
measures P and Q such (x,y) = (D t (P,Q) , D g (P,Q)) . 
A (/, g)-divergence pair (x, y) is achievable in R d if there 
exist probability vectors P, Q £ R d such that 

(x,y) = (D f (P,Q),D g (PQ)). 

Lemma 5: Assume that 



and 



P (A) = Q (A) = 1 

P 1 (B) = Qx (B) = 1 

(1 - a) P + aPi and Q c 



and that A n B = 0. If P c 
(1 — a) Qo + olQx then 

D f (P a ,Q a ) = (1 - a) D f (P , Qo) + aD f (Pi, Qi) . 

Theorem 6: The set of (/, g) -divergence pairs is convex. 
Proof: Assume that (P, Q) and (P, Q^j are two pairs 
of probability distributions on a space (X, F) . Introduce a 
two-element set B = {0,1} and the product space XxB 
as a measurable space. Let <fi denote projection on B. Now 
we define a pair LP, Q^of joint distribution on XxB. The 

marginal distribution of both P is Q on B is (1 — a, a) . The 
conditional distributions are given by P (• | <j> = i) = P^ and 
Q (• | (j> = i) = Qi where i = 0, 1. Then 



Df (Pq , Qa) 

Dg [Pot 1 Qa) 

{I -a) Df (P ,Qo) + oD f (Pi, Qi) 
(1 - a) D g (P Q , Qo) + aD g (Pi, Qi) 

D f (P 0> Qo) \, n f P/(Pi,Qi) 
Dg(P ,Q ) J +a { D g (P 1 ,Q 1 ) 

Df (P, Q) 



= (!-«) 



= (l-a) 



Df (PQ) 
D g (P,Q) 



DjP,Q) 



Example 7: For the joint range of total variation and Jensen 
Shannon divergence illustrated on Figure [2] the set of pairs 
achievable in R 2 is not convex but the set of pairs achievable 
in R 3 is convex and equals the set of all (/, g) -divergence 
pairs. 

Theorem 8: Any (/, g) -divergence pair is a convex combi- 
nation of two (/, <7)-divergence pairs, both of them achievable 
in R 2 . Consequently, any (/, g) -divergence pair is achievable 
inR 4 . 

Proof: Let P and Q denote probability measures on the 
same measurable space. Define the set A = {q > 0} and the 
function X = p/q on A. Then Q satisfies 



Q(A) = 1, 
X dQ < 1. 



(8) 



Now we fix X and A. The formulas for the divergences 
become 

D f (P,Q)= [ f(X) dQ + f* (0) P (CA) 

J A 

= J A f( X ) dQ + f*(0) (l- J XdQ\ 

= [ (/ w + r (o) (i - a-)) dQ 

J A 

= E[f(X)+f* (0)(l-X)} 

and similarly 

D g (P,Q)=E[g(X) + g* (0) (1 - X)} . 

Hence, the divergences only depend on the distribution of X. 
Therefore we may without loss of generality assume that Q is 
a probability measure on [0, oo). 

Define C as the set of probability measures on [0, oo) 
satisfying E [X] < 1. Let C + be the set of additive measures 
H on [0, oo) satisfying fi(A) < 1 and J»X dfj, < 1. Then 
C + is convex and thus compact under setwise convergence. 
According to the Choquet-Bishop-de Leeuw theorem ifTTl 
Sec. 4] any other point in C + is the barycenter of a probability 
measure over such extreme points. In particular an element 
Q € C is the barycenter of a probability measure Pb ary 
over extreme points of C + and these extreme points must 
in addition be probability measures with i-b ar j,-probability 1. 
Hence Q £ C is a barycenter of a probability measure over 
extreme points in C. 

Let Q be an element in C. Let At, i = 1,2,3 be a disjoint 
cover of [0, oo) and assume that Q (Ai) > 0. Then 

3 

Q = J2Q(Ai)Q(-\Ai). 

i=l 

For a probability vector A = (Ai,A 2 ,A2) let Q\ denote the 
distribution 

3 

Qa = $> 1 Q(- I Ai). 

i=l 

Then Q\ is element in C if and only if 

3 

VA, X dQ(- | Ai) < 1. (9) 

i=i 

An extreme probability vector A that satisfies (0 has one or 
two of its weights equal to 0. Hence, if Q is extreme in C 
and A^ i = 1, 2, 3 is a disjoint cover of A, then at least one 
of the three sets satisfies Q (Ai) — 0. Therefore an extreme 
point Q G C is of one of the following two types: 

1) Q is concentrated in one point. 

2) Q has support on two points. In this case the inequality 
J A X dQ < I holds with equality and P (A) = 1 so 
that P is absolutely continuous with respect to Q and 
therefore supported by the same two-element set. 

The formulas for divergence are linear in Q. Hence any 
(/, g) -divergence pair is a the barycenter of a probability 



yi 




Fig. 3. The slashed curve connects yi and y2. The lines l 1 and l 2 w& 
not illustrated. 



measure Pbary over pairs generated by extreme distributions 
Q € C. The extreme distributions of type 2 generate pairs 
achievable in IR 2 . 

For extreme points Q concentrated in a single point we can 
reverse the argument at make a barycentric decomposition with 
respect to P. If an extreme P has a two-point support then 
Q is absolutely continuous with respect to P and generates a 
(/, g)-divergence pair achievable in R 2 . If P is concentrated 
in a point then this point may either be identical with the 
support of Q and the two probability measures are identical, 
or the support points are different and P and Q are singular 
but still (P,Q) is supported on two points. Therefore any 
(/, g) -divergence pair has a barycentric decomposition into 
pairs achievable in K 2 . 

Let y = (y, z) be a (/, g)-divergence pair. As we have 
seen y is a barycenter of (/, 5) -divergence pairs achievable 
in M 2 . According to the Caratheodory's theorem [12] any 
barycentric decomposition in two dimensions may be obtained 
as a convex combination of at most three points y,, i — 1, 2, 3. 
as illustrated in Figure [3] Assume that all three points have 
positive weight. Let £i be the line through y and y^ . The point 
y divides the line li in two half-lines £j~ and £^ , where £~ 
denotes the halfline that contains y,. The lines if, i = 1,2,3 
divide M 2 into three sectors, each of them containing one of 
the points y,,i = 1,2,3. The set of (/, g)-divergence pairs 
achievable in R 3 is curve-connected so there exist a continuous 
curve of (/, g)-divergence pairs achievable in M 2 from yi to 
y2 that must intersect £\ U £3 in a point z. If z lies on if 
then y is a convex combination of the two points y^ and z. 
Hence, any (/, g)-divergence pair is a convex combination of 
two points that are (/, g)-divergence pairs achievable in R 2 . 
From the construction in the proof of Theorem [6] we see that 
any (/, g)-divergence pair is achievable in R 4 . ■ 

Remark 9: We do not have any example of functions (/, g) 
such that the set of pairs achievable in R 3 is not convex. 

Remark 10: An /-divergence on a arbitrary cr-algebra can 
be approximated by the /-divergence on its finite sub- 
algebras. Any finite cr-algebra is a Borel cr-algebra for dis- 
crete space so for probability measures P, Q on a cr-algebra 



the point (Df (P,Q) , D g (P,Q)) is in the closure of the 
pairs achievable in K 4 . For many function pairs ((/, g)) 
the set of pairs achievable in M 2 is closed and then the 
set of all (/, g) -divergence pairs is closed and contains 
{Df (P, Q) , D g (P, Q)) even if P, Q are measures on a non- 
atomic cr-algebra. 

The set of (/, g)-divergence pair that are achievable in M 2 
can be parametrized as P = (1 — p,p) and Q = (1 — q,q) . 
If we define (1— p,p) — (p, 1— p) then Df(P,Q) = 
Df CP, Q) . Hence we may assume without loss of generality 
assume that p < q and just have to determine the image of 
the simplex A = {(p, q) | < p < q < 1} . This result makes 
it very easy to make a numerical plot of the (/, g) -divergence 
pair achievable in R 2 and the joint range is just the convex 
hull. 

III. Image of the triangle 

In order to determine the image of the triangle A we have 
to check what happens at inner points and what happens at or 
near the boundary. Most inner points are mapped into inner 
points of the range. On subsets of A where the derivative 
matrix is non-singular the mapping (P,Q) — > (Df,D g ) is 
open according to the open mapping theorem from calculus. 
Hence, all inner points that are not mapped into interior points 
of the range must satisfy 



dD, 



dq 



dD g 
dp 
dD g 



= 0. 



Depending on functions / and g this equation may be easy 
or difficult to solve, but in most cases the solutions will lie 
on a 1 -dimensional manifold that will cut the triangle A into 
pieces, such that each piece is mapped isomorphically into 
subsets of the range of (P,Q) (Df,D g ). Each pair of 
functions (/, g) will require its own analysis. 

The diagonal p = q in A is easy to analyze. It is mapped 
into (D f ,D g ) = (0,0). 

Lemma 11: If / (0) = oo, and lim t 
the supremum of 



oinf jffi = 8 Q , then 



B-D f (PQ)-D g (P,Q) 

over all distributions P, Q is oo if 8 > 8q. 

git) 
fit) 



If /* (0) = oo, and lim^oo inf 4ft = B , then the 



supremum of 

B-Df {P,Q)-D g (P,Q) 

over all distributions P, Q is oo if 8 > 8q. 

a(t) 
fit) 



If g (0) = oo, and limt_>o sup jA — 70, then the supre- 



mum of 



D g (P,Q)-jD f (P,Q) 



over all distributions P, Q is 00 if 7 < 70. 



g(t) _ 



If g* (0) = 00, and lim t _ ) . 00 sup j^y 
supremum of 

D g (Q,P)- 1 D f (Q,P) 
over all distributions P, Q is 00 if 7 < 70. 



7o, then the 



Proof: Assume that 



/ (0) = 00 and lim inf 



9 (t) 



= A>- 



t-X) / (t) 
The first condition implies 

D / ((l,0),(l/2,l/2)) = oo 

and the second condition implies that g (0) = 00 and 

D g ((1,0), (1/2, 1/2)) = oo. 

We have 

£> g ((p,l-p),(l/2,l/2)) 
£> / ((p J l-p) > (l/2,l/2)) 

_ g (2p)/2 + g(2(l-p))/2 
/(2p)/2 + /(2(l-p))/2 

= g(2p)+g(2(l-p)) 
/(2p) + /(2(l-p))' 



Let (t n ) be a sequence such that 



Then 



3(U 

JW) 



J> fl ((^,l-^),(l/2,l/2)) 
^((^,1-^,(1/2,1/2)) 



8 for n — > 00. 



8 



and the first result follows. 

The other three cases follows by interchanging / and g, 
and/or replacing / by /* and g by g* . We have used that 

lim inf 9 -M = lim inf = lim u*°® 



t->0 f*(t) f-H) tf (t- 1 ) t- 



/(*) 



Proposition 12: Assume that / and g are C 2 and that 
/" (1) > and g" (1) > 0. Assume that lim t _> inf 4| > 0, 

and that lim t ^oo inf j^y > 0. Then there exists 8 > such 



that 



D g (P,Q)>B-D f (P,Q) 



(10) 



for all distributions P, Q. 



Proof: The inequality lim t _j. inf j& > implies that 
there exist BqJq > such that g(t) > Bof (t) for t < to. 
The Inequality lim^^ inf > implies that there exists 
Boo > and > such that g (t) > B^ f (t) for t > t^. 
According to Taylor's formula we have 

for some 9 and r\ between 1 and t. Hence 

^ = r^ m 

/(*) 9"M 9"(1) 
Therefore there there exists 81 > and an interval ]t_,t+[ 
around 1 such that j& > 8\ for t 6 }t-,t + [. The function 



t 



git) 



fit) 



y^y is continuous on the compact set [to,i-] U [t+,ioo] 



so it has a minimum 8 > on this set. Inequality \T0\ holds 
for/3 = min{/3 ,/3 1 ,/3 oo ,/3}. ■ 



IV. Bounds for power divergences 

As an example we shall determine the exact range of a pair 
of power divergences. We have 

f(t) = Mt), 
s (*) = &(*)• 

In this case we have 

D f ((p,l-p),(q,l-q)) = 

1 ( {p-q? 

2 [ 9 



and hence 



(p-qY 

1-q 



DA(p,l-p),(q,l-q)) 











(f 









(l-9)-l 



First we determine the image of the triangle. The derivatives 
are 



dD 



f 



dp 

dDf 
dq 

dp 
dq 



2 (p-q) 

2 ' (l-q)q ' 

1 (2pq -q-p)(p-q) 

2' (l-q) 2 -? 2 

-3 (2pq- q-p)(p- q) 

6 ' (l-g)V 

pq + p 2 + q 2 - 
2>pq 2 - 3p 2 q + 3p 2 q 2 



(p-q) 



(q-l)q 



3 „3 



The determinant of derivatives is 

dD f dD g 
dp dp 

dP f dDg 



(p - qY 



12g4 (1 _ q )" 



2pq-q-p 



3p + 3q — 6pq 
6pq 2 - 2p 2 - 2q 2 
-2pq + 6p 2 q- 6p 2 q 2 



1 



V 



12 \q(l-q) 

We see that the determinant of derivatives is different from 
zero for p ^ q so the interior of A is mapped one-to-one to the 
image. Hence we just have to determine the image of points 
on the boundary of A (or near the boundary if undefined on 
the boundary). 

For P = (1,0) and Q = (l-q,q) we get 



i 



1-9 



1 



D g {P,Q) = \ 



6 \{l-qf 
The first equation leads to 



_1 1 _ 1 (2-9)9 
" ) ~ 6 (l-qf 



9=U- 



2D f + 1 



D g = 3^/ (D f + 1) 



We have 



g(t) 



t 2 -2(t-l)-l 
2 

t 3 -3(t-l)-l 
6 



oo for t — > oo. 



All points (0, s) , s € [0, oo) are in the closure of the range 
of (P, Q) — > (-D/, -^a) • com bi n g these two results we see 
that the range consists of the point (0, 0) , all points on the 
curve (x, %x (x + 1)) ,x G (0, oo), and all point above this 
curve. 

Similar results holds for any pair of power divergences, but 
for other pairs than (D2, D3) the computations become much 
more involved. 

Note that the Renyi divergences are monotone functions 
of the power divergences so our results easily translate into 
the results on Renyi divergences. More details on Renyi 
divergences can be found in fl3l . 
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